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ABSTRACT 

Instruction set simulators are critical tools for the explo- 
ration and validation of new programmable architectures. 
Due to increasing complexity of the architectures and time- 
to-market pressure, performance is the most important fea- 
ture of an instruction-set simulator. Interpretive simulators 
are flexible but slow, whereas compiled simulators deliver 
speed at the cost of flexibility. This paper presents a novel 
technique for generation of fast instruction-set simulators 
that combines the benefit of both compiled and interpre- 
tive simulation. We achieve fast instruction accurate simu- 
lation through two mechanisms. First, we move the time- 
consuming decoding process from run-time to compile time 
while maintaining the flexibility of the interpretive simula- 
tion. Second, we use a novel instruction abstraction tech- 
nique to generate aggressively optimized decoded instruc- 
tions that further improves simulation performance. Our 
instruction set compiled simulation (IS-CS) technique deliv- 
ers upto 40% performance improvement over the best known 
published result that has the flexibility of interpretive simu- 
lation. We illustrate the applicability of the IS-CS technique 
using the ARM7 embedded processor. 

Categories and Subject Descriptors 

1.6.5 [Simulation And Modeling]: Model Development; 
1.6.7 [Simulation And Modeling]: Simulation Support 
Systems 

General Terms 

Design, Performance 
Keywords 

Compiled Simulation, Interpretive Simulation, Instruction 
Set Architectures, Instruction Abstraction 
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1. INTRODUCTION 

An instruction-set simulator is a tool that runs on a host 
machine to mimic the behavior of running an application 
program on a target machine. Instruction-set simulators are 
indispensable tools in the development of new programmable 
architectures. They are used to validate an architecture de- 
sign, a compiler design, as well as to evaluate architectural 
design decisions during design space exploration. 

Traditional interpretive simulation is flexible but slow. In 
this technique, an instruction is fetched, decoded, and exe- 
cuted at run tune as shown in Figure 1. Instruction decoding 
is a time consuming process in a software simulation. 




Figure 1: Traditional Interpretive Simulation Flow 

Compiled simulation performs compile time decoding of 
application program to improve the simulation performance 
as shown in Figure 2. To improve the simulation speed fur- 
ther, static compilation based techniques move the instruc- 
tion scheduling into the compilation phase [4]. However, all 
compiled simulators rely on the assumption that the com- 
plete program code is known before the simulation starts 
and is further more run-time static. Due to this assumption 
many application domains are excluded from the utilization 
of compiled simulators. For example, embedded systems 
that use external program memories can not use compiled 
simulators since the program code is not predictable prior to 
runtime. Similarly, compiled simulators are not applicable 
in embedded systems that use processors having multiple 
instruction sets. These processors can switch to a different 
instruction set mode at run time. For instance, the ARM 
processor uses the Thumb (reduced bit-width) instruction 
set to reduce power and memory consumption. This dy- 
namic switching of instruction set modes cannot be consid- 
ered by a simulation compiler, since the selection depends 
on run-time values and is not predictable. Furthermore, 
applications with run-time dynamic program code, as pro- 
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vided by operating systems (OS), can not be addressed by 
compiled simulators. 




Figure 2: Traditional Compiled Simulation Flow 

Due to the restrictiveness of the compiled technique, in- 
terpretive simulators are typically used in embedded sys- 
tems design flow. This paper presents a novel technique 
for generation of fast instruction-set simulators that com- 
bines the performance of traditional compiled simulation 
with the flexibility of interpretive simulation. Our instruc- 
tion set compiled simulation (IS-CS) technique achieves high 
performance due to two reasons. First, the time consuming 
instruction decoding process is moved to compile time while 
maintaining the flexibility of interpretive simulation. In case 
an instruction is modified at run-time, the instruction is re- 
decoded prior to execution. Second, we use an instruction 
abstraction technique to generate aggressively optimized de- 
coded instructions that further improve simulation perfor- 
mance. The IS-CS technique delivers better performance 
than other published simulation techniques that have the 
flexibility of interpretive simulation. The simulation perfor- 
mance of the IS-CS technique is upto 40% better than the 
best known results [1] in this category. 

The rest of the paper is organized as follows. Section 2 
presents related work addressing instruction-set simulation 
techniques. The instruction set compiled simulation (IS-CS) 
technique is presented in Section 3. Section 4 presents sim- 
ulation results using the ARM7 architecture, a commonly 
used embedded processor. Section 5 concludes the paper. 

2. RELATED WORK 

An extensive body of recent work has addressed instruction- 
set architecture simulation. The wide spectrum of today's 
instruction-set simulation techniques includes the most flex- 
ible but slowest interpretive simulation and faster compiled 
simulation. Recent research addresses retargetabiUty of in- 
struction set simulators using a machine description lan- 
guage. 

Simplescalar [3] is a widely used interpretive simulator 
that does not have any performance optimizations for func- 
tional simulation. 

Shade [5], Embra [10] and FastSim [8] simulators use dy- 
namic binary translation and result caching to improve sim- 
ulation performance. Embra provides the highest flexibiUty 
with maximum performance but is not retargetable: it is 
restricted to the siihulation of the MIPS R3000/R4000 ar- 
chitecture. 

A fast and retargetable simulation technique is presented 
in [6] . It improves traditional static compiled simulation by 
aggressive utilization of the host machine resources. Such 
utilization is achieved by defining a low level code gener- 
ation interface specialized for ISA simulation, rather than 
the traditional approaches that use C as a code generation 
interface. 



Retargetable fast simulators based on an Architecture De- 
scription Language (ADL) have been proposed within the 
framework of FACILE [9], Sim-nML [12], ISDL [14], MI- 
MOLA [16], ANSI C [11], LISA ([1], [2], [4]), and EXPRES- 
SION [15]. The simulator generated from a FACILE de- 
scription utilizes the Fast Forwarding technique to achieve 
reasonably high performance. All of these simulation ap- 
proaches assumes that the program code is run-time static. 

In summary, none of the above approaches (except [1]) 
combines retargetabiUty, flexibility, and high simulation per- 
formance at the same time. A just-in-time cache compiled 
simulation (JIT-CCS) technique is presented in [1]. The ob- 
jective of the JIT-CCS technique is similar to the one pre- 
sented in this paper - combining the full flexibility of inter- 
pretive simulators with the speed of the compiled principle. 
The JIT-CCS technique integrates the simulation compiler 
into the simulator. The compilation of an instruction takes 
place at simulator run-time, jvst-inr-time before the instruc- 
tion is going to be executed. Subsequently, the extracted 
information is stored in a simulation cache for direct reuse 
in a repeated execution of the program address. The simula- 
tor recognizes if the program code of a previously executed 
address has changed and initiates a re-compilation. This 
technique makes an assumption to get performance closer 
to complied simulation: the number of repeatedly executed 
instructions should be very large such that 90% of the ex- 
ecution time is spent in 10% of the code. This assumption 
may not hold true for all real world applications. For exam- 
ple, the m.gcc benchmark from SPEC CPU2DO0 violates 
this rule. 

We propose an instruction set compiled simulation (IS- 
CS) technique where the program is compiled prior to run 
time and executed interpretively as shown in Figure 3. The 
simulator recognizes if the program code of a previously ex- 
ecuted address has changed and initiates a re-decoding. We 
achieve both the performance of compiled simulation and 
flexibihty of interpretive simulation. The simulation perfor- 
mance of the IS-CS technique is upto 40% better than the 
best known result [1] in this category. There are two rea- 
sons for its superior performance. First, the time consum- 
ing instruction decoding process is moved to compile time 
while maintaining the flexibility of interpretive simulation. 
Second, we use a novel instruction abstraction technique to 
generate aggressively optimized decoded instructions that 
further improve simulation performance. 

3. INSTRUCTION SET COMPILED SIMU- 
LATION 

We developed the instruction set compiled simulation (IS- 
CS) technique with the intention of combining the full flex- 
ibility of interpretive simulation with the speed of the com- 
piled principle. The basic idea is to move the time-consuming 
instruction decoding to compile time as shown in Figure 3. 
The application program, written in C/C-|~|-, is compiled 
using the gcc compiler configured to generate binary for the 
target machine. The instruction decoder decodes one bi- 
nary instruction at a time to generate the decoded program 
for the input apphcation. The decoded program is com- 
piled by C++ compiler and linked with the simulation li- 
brary to generate the simulator. The simulator recognizes if 
the previously decoded instruction has changed and initiates 
re-decoding of the modified instruction. If any instruction 
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Figure 3: Instruction Set Compiled Simulation Flow 



is modified during execution and subsequently re-decoded, 
the location in instruction memory is updated with the re- 
decoded instruction. To improve the simulation speed we 
use a novel instruction abstraction technique that generates 
optimized decoded instructions as described in Section 3.1. 
As a result the computation during run-time is minimized. 
This technique achieves the speed of compiled simulation 
due to compile-time decoding of application as described in 
Section 3.2. Section 3.3 describes the simulation engine that 
offers the full flexibility of interpretive simulation. 

3.1 Instruction Abstraction 

In traditional interpretive simulation (e.g., Simplescalar 
[3]) the decoding and execution of binaxy instructions are 
done using a single monolithic function. This function has 
many if-then-else and switch/case statements that perform 
certain activities based on bit patterns of opcode, operands, 
addressing modes' etc. In advanced interpretive simulation 
(e.g., LISA [1]) the binary instruction is decoded and the 
decoded instruction contains pointers to specific functions. 
There are many variations of these two methods based on 
efficiency of decode, complexity of implementation, and per- 
formance of execution. However, none of these techniques 
exploit the fact that a certain class of instructions may have 
a constant value for a petrticular field of the instruction. For 
example, a majority of the ARM instructions execute uncon- 
ditionally (condition field has value always). It is a waste of 
time to check the condition for such instructions every time 
they are executed. 

Clearly, when certain input values are known for a class 
of instructions, the partial evaluation [13] technique can be 
applied. The effect of partial evaluation is to specialize a 
program with part of its input to get a faster version of the 
same program. To take advantage of such situations we need 
to have separate functions for each and every possible for- 
mat of instructions so that the function could be optimized 
by the compiler at compile time and produce the best per- 
formance at run time. Unfortunately, this is not feasible 
in practice. For example, consider the ARM data process- 
ing instructions. It can have 16 conditions, 16 operations, 
an update flag (true/false), and one destination followed by 



two source operands. The second source operand, called 
shifter operand, has three fields: operand type (reg/imm), 
shift options(5 types) and shift value (reg/imm). In total, 
the ARM data processing instructions have 16x16x2x2 
X 5 x 2 10240 possible formats. 

Our solution to this problem is to define instruction classes, 
where each class contains instructions with similar formats. 
Most of the time this information is readily available from 
the instruction set architecture manual. For example, we de- 
fined six instruction classes for the ARM processor viz.. Data 
Processing, Branch, LoadStore, Multiply, Multiple Load- 
Store, Software Interrupt, and Swap. Next, we define a set 
of masks for each instruction class. The mask consists of '0', 
'1' and 'x' symbols. A '0' ('1') symbol in the mask matches 
with a '0' ('!') in binary pattern of the instruction at the 
same bit position. An 'x' symbol matches with both '0' and 
'1'. For example, the masks for the data processing instruc- 
tions are shown below: 

"XKOC-OOlX xxxx-xxxx xxxx-xxxx xxxx-xxxx" 
"xxxx-OOOx xxxx-xxxx xxxx-xxxx xxxO-xxxx" 

We use C++ templates to implement the functionality for 
each class of instructions. For example, the pseudo code for 
the data processing template is shown below. The template 
has folu: parameters viz., condition, operation, update flag, 
and shifter operand. The shifter operand is a template hav- 
ing three parameters viz., operand type, shift options and 
shift value. 

Example 1: Template for Data Processing Instructions 

template <class Cosd, class Op, class Flag, class SftQper> 

class DataProcessing : 

{ 

SftOper .sftOperand; 

public: 

virtual void execute 0 
{ 

if (Cond: : execute ()) 
{ 

_dest " Op: : execute Csrcl, .sftOperand. get Value ()) ; 

if (Flag::execute()} 

{ 

// Update Flags 



We also use a Mask Table for the mapping between mask 
patterns and templates. It also maintains a mapping be- 
tween mask patterns and functions corresponding to those 

templates. 

This instruction abstraction technique is used to generate 
aggressively optimized decoded instructions as described in 
Section 3.2. 

3.2 Instruction Decoder 

Algorithm 1 decodes one binary instruction at a time to 
generate the decoded program for the input application. 
For each instruction in the application program it selects 
the appropriate template using Algorithm 2. It generates 
a customized template for the instruction using the appro- 
priate parameter values. Algorithm 3 briefly describes the 
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customized template generation process. Finally, the cus- 
tomized template is instantiated and appended in the tem- 
porary program TempProgram. The TempProgram is fed 
to a C-t-l- compiler that performs necessary optimizations 
to take advantage of the partial evaluation technique, de- 
scribed in Section 3.1, to produce the DecodedProgram. The 
DecodedProgram is loaded into instruction memory which is 
a separate data structure than main memory. While the 
main memory holds the original program data and instruc- 
tion binaries, each cell of instruction memory holds a pointer 
to the optimized functionality as well as the instruction bi- 
nary. The instruction binary is used to check the validity of 
the decoded instruction during run-time. 

Algorithm 3 describes the template customization pro- 
cess. The algorithm's basic idea is to extract the values 
from specific fields of the binary instruction (e.g., opcode, 
operand etc.) and assign those values to the template. We 
maintain the information for each class of instructions^ tem- 
plates, field formats, and mask patterns. These information 
can be derived firom the processor specification described us- 
ing an Architecture Description Language such as LISA [1], 
EXPRESSION [7] and nML [12]. 



We illustrate the power of our technique to generate an 
optimized decoded instruction using a single data processing 
instruction. We show the binary as well as the assembly of 
the instruction below. 

Binary: 1110 1 000 1 0100 1 0 1 0010 1 0001 1 ClOlO 1 00 1 0 1 0011 

(coudlOOOl op ISl Rq I Rd Isbift immedl shift lOIRm) 

Assembly: ADD rl, r2, r3 LSL #10 

(op{<coiid>>{S> Rd, Rn, Rm shift #<iiimed>) 

The DetermineTemplate function returns the DataPro- 
cessing template (shown in Example 1) for this binary in- 
struction. The CustomizeTemplate function generates the 
following customized template for the execute function. 

void DataProcessiiig<Always, Add, False, 

SftOper<Reg, ShiftLeft, Inm» :; execute () 

{ 

if (Always:: execute 0) { 

_dest = Add:; execute C.srcl, .sftOperand.getValueO) ; 
if (False:: execute ()) { 
// Update Flags 



Algorithm 1: Instruction Decoding 

Inputs: Application Program Appl (Binary), MaskTable maskTable. 

Output: Decoded Program DecodedProgram. 

Begin 

TempProgram = {} 

foreach binary instruction inst with address addr in Appl 
template = DetermineTemplate(inst, maskTable) 
templateinat = CustomizeTemplate(temp{ate, inst) 
newStr = "InstMemory[o(iiir] — new tempJotei„,t" 
TempProgram = AppendInst(TempProgram, newStr) 

endfor 

DecodedProgram = Compile( TempProprYim) 



After compilation using a C-f+ compiler, several opti- 
mizations occur on the executeQ function. The Always: :exec- 
uteQ function call is evaluated to true. Hence, the check is 
removed. Similarly, the function call False:: executeQ is eval- 
uated to false. As a result the branch and the statements 
inside it are removed by the compiler. Finally, the two func- 
tion calls Add::execute(), and sftOperand.getValueQ get in- 
lined as well. Consequently, the execute() function gets op- 
timized into one single statement as shown below: 



Algorithm 2: DetermineTemplate 

Inputs: Instruction inst (Binary), and Mask Table maskTable. 
Output; Template. 

foreach entry < mask, template > in Mask Table 
if mask matches inst return template 

endfor 



zeTemplate 

late. Instruction inst (Binary). 
;mplate with Parameter Values. 



Algorithm 3: 
Inputs: Temp 

switch instClassOf(j>ist) 

case Data Processing: 
switch (inst[31:28]) 

case 1110: condition = Always 



endswitch 
switch (inst[24:21]) 

case 0100: opcode = 



endcase /* Data Processing */ 
case Branch: ... endcase 



case Multiply LoadStore: 
case Software Interrupt: . 
case Swap: ... endcase 
endswitch 



} 



Furthermore, in many ARM instructions, the shifter oper- 
and is a simple register or immediate. Therefore, the shift 
operation is actually a no shift operation. Although the 
manual says that the case is equivalent to shift left zero, we 
use a no shift operation that enables further optimization. 
In this way, an instruction similar to the above example 
would have only one operation in its executeQ method. 

3.3 Simulation Engine 

Due to compile time decoding and our instruction abstrac- 
tion technique, the simulation engine is fast and simple. In 
this section we briefly describe the three basic steps in the 
simulation kernel viz., fetch, decode (if necessary), and exe- 

The simulation engine fetches one decoded instruction at a 
time. As mentioned earUer, each instruction entry contains 
two fields viz., binary and pointer to the optimized func- 
tionality for the instruction. Before executing the fetched 
instruction, it is necessary to verify that the current instruc- 
tion is valid i.e., this instruction is not modified during run 
time. The simulation engine compares the binary part of 
the current instruction having address addr with the binary 
instruction of the apphcation program stored in memory at 
address addr. If they are equal then the decoded instruction 
is valid and the engine executes the optimized functionality 
referenced by the instruction. 
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However, if the instniction is modified then the modified 
binary is re-decoded. This decoding is similar to the one 
performed in the compile time decoding of instructions ex- 
cept that it uses a pointer to an appropriate function. While 
we develop the templates for each class of instructions, we 
also develop one function for each class. The mask table 
mentioned in Section 3.1 maintains the mapping between a 
mask for every class of instruction and the function for that 
class. The decoding step during run time consults the mask 
table and determines the function pointer. It also updates 
the instruction memory with the decoded instruction i.e., it 
writes the new function pointer in that address. 

The execution process is very simple. It simply invokes 
the function using the pointer specified in the decoded in- 
struction. 

Since the number of instructions modified during run time 
are usually negligible, using a general unoptimized function 
for simulating them does not degrade the perfonnance. It is 
important to note that since the engine is still very simple, 
we can easily use traditional interpretive techniques for ex- 
ecuting modified instructions while the instruction set com- 
piled technique can be used for the rest (majority) of the 
instructions. Thus, our instruction set compiled simulation 
(IS-CS) technique combines the fuU flexibihty of interpretive 
simulation with the speed of the compiled simulation. 

4. EXPERIMENTS 

We evaluated the appUcability of our IS-CS technique us- 
ing various processor models. In this section, we present sim- 
ulation results using a popular embedded processor, ARM7 
[17], to demonstrate the usefulness of our approach. 

4.1 Experimental Setup 

The ARM? processor is a RISC machine with fairly com- 
plex instruction set. We used arm-linux-gcc for generating 
target binaries for ARM7. Performance results of the dif- 
ferent generated simulators were obtained using Pentium 3 
at 1 GHz with 512 MB RAM running Windows 2000. The 
generated simulator code is compiled using the Microsoft Vi- 
sual Studio .NET compiler with all optimizations enabled. 
The same C++ compiler is used for compiUng the decoded 
program as well. 

In this section we show the results using two apphcation 
programs: adpcm and jpeg. We have used these two bench- 
marks to be able to compare our simulator performance with 
previously pubUshed results [1]. 

The arm-linux-gcc with -static option geiierates approxi- 
mately 50K instructions for the benchmarks. When all op- 
timizations are enabled in the MS VC++ compiler, it takes 
about 15 minutes to compile and generate the decoded pro- 

4.2 Results 

Figure 4 shows the simulation performance using our tech- 
nique. The results were generated using an ARM7 model. 
The first bar shows the simulation performance of our tech- 
nique with run-time program modification check enabled. 
Our technique can perform better if it is known prior to ex- 
ecution that the program is not self modifying. The second 
bar represents the simulation performance of running the 
same benchmark by disabling the run-time check. We could 
achieve upto 9% performance improvement by disabUng the 
instruction modification detection and updation mechanism. 
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Figure 4: Instruction Set Compiled Simulation 



We are able to perform simulation at a speed of upto 12 
MIPS using the P3 (1.0 GHz) host machine. To the best of 
our knowledge the best performance of a simulator having 
the flexibility of interpretive simulation has been JIT-CCS 
[1]. The JIT-CCS technique could achieve a performance 
upto 8 MIPS on an Athlon at 1.2 GHz with 768 MB RAM. 
Since we did not have access to a similar machine, our com- 
parisons axe based on results run on a slower machine (Pen- 
tium 3 at 1 GHz with 512 MB RAM) versus previous results 
[1] on a faster machine (Athlon at 1.2 GHz with 768 MB 
RAM). On the jpeg benchmark our IS-CS technique per- 
forms 40% better than JIT-CCS technique. The same trend 
(30% improvement) is observed in case of adpcm benchmark 
as well. Clearly, these axe conservative numbers since our 
experiments were run on a slower madiine. 



| | HSimpleScalar B FunclionPoinler □ISCompiled 
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Figure 5: Effect of Different Optimizations 

There are two reasons for the superior performance of our 
technique: moving the time consuming decoding out of the 
execution loop, and generating aggressively optimized code 
for each instruction. The effects of using these techniques 
are demonstrated in Figure 5. The first bar in the chart is 
the simulation performance of running the benchmarks on 
an ARM7 model of Simplescalar [3] that does not use any 
of these techniques. The second bar shows the effect of do- 
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ing the decoding process at compile time and using function 
pointers during execution. The use of function pointer in 
the decoded instruction is similar to [1]. We are able to 
achieve better result than JIT-CCS [1] even in this category 
because of the fact that JIT-CCS technique performs decod- 
ing of instruction during run-time (at least once) while we 
are doing it during compile time. Besides, they use a soft- 
ware caching technique to reuse the decoded instruction but 
we do not. The last bar is our simulation approach that uses 
both techniques: compile-time decode and using templates 
to produce optimized code. 

We have demonstrated that instruction set compiled sim- 
ulation coupled with our instruction abstraction technique 
delivers the performance of compiled simulation while main- 
taining the flexibility of interpretive simulation. Our simu- 
lation technique delivers better performance than other sim- 
ulators in this category, as demonstrated in this section. 

5. SUMMARY 

In this- paper we presented a novel technique for instruc- 
tion set simulation. Due to the simple interpretive simula- 
tion engine and optimized pre-decoded instructions, our in- 
struction set compiled simulation (IS-CS) technique achieves 
the performance of compiled simulation while maintaining 
the flexibility of interpretive simulation. The performance 
can be further improved by disabling the run-time change 
detection which is suitable for many applications that are 
not self modifying. 

The IS-CS technique achieves its superior performance 
for two reasons: moving time-consuming decode to compile 
time, and using templates to produce aggressively optimized 
code for each instance of instructions. We demonstrated per- 
formance improvement of upto 40% over the best pubhshed 
results on an ARM7 model. 

Future work will concentrate on using this technique for 
modeUng other real world architectures using an architec- 
ture description language to demonstrate the retargetability 
of this approach. 
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